Word Embeddings and Deep Learning models are a new way of preprocessing data. Instead of counting each word and treating the document-feature-matrix as input, you translate either the dfm or the text directly into an embedding space.
However, the big advantage of word embeddings and especially transformer models is that people can create enormous language models, trained on billions of texts, that come as close as we’ve ever been to getting computers to understand the meaning of language. This unfortunaltly also means though, that large language model are a domain of the richest companies and research facilities and are not easy to create by individual researchers.
Compared to other approaches like naive bayes or svm algoirthms, we are also still relativly early for this new technology. The step that happened about 10-15 years ago when many of the things were implemented in R has not really happened yet. Meanwhile, the models also only run on new powerful hardware.
So this session is currently more a preview than an actual hands-on tutorial.
R wrappers for large language models
Another problem with LLMs is that they are predominanlty controlled from Python. R has excellent wrappers for languages like C, C++, Rust or Java, but Python still falls a little behind in terms of comfort of usage. Packages like spacyr and grafzahl try to employ Python anyway through the reticulate compatibility layer. (They do still have some issues to figure out.)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
imdb <-readRDS("data/imdb.rds")set.seed(1)split <-initial_split(data = imdb, prop =3/4, # the prop is the default, I just wanted to make that visiblestrata = label # this makes sure the prevalence of labels is still the same afterwards) imdb_train <-training(split)imdb_test <-testing(split)
Warning: The request to
`use_python("/home/johannes/.local/share/r-miniconda/envs/grafzahl_condaenv_cuda/bin/python")`
will be ignored because the environment variable RETICULATE_PYTHON is set to
"/mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env/bin/python"
Conda environment grafzahl_condaenv_cuda is initialized.
Warning: Python
'/home/johannes/.local/share/r-miniconda/envs/grafzahl_condaenv_cuda/bin/python'
was requested but
'/mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env/bin/python' was
loaded instead (see reticulate::py_config() for more information)
saveRDS(model, "8_imdb_distilbert.rds")# model <- readRDS("8_imdb_distilbert.rds")
If you’re here, you probably already know R so why re-learn things from scratch?
R is a programming language specifically for statistics with some great built-in functionality that you would miss in Python.
R has absolutely outstanding packages for data science with no drop-in replacement in Python (e.g., ggplot2, dplyr, tidytext).
Why not just stick with R then?
Newer models and methods in machine learning are often Python only (as advancements are made by big companies who rely on Python)
You might want to collaborate with someone who uses Python and need to run their code
Learning a new (programming) language is always good to extend your skills (also in your the language(s) you already know)
Getting started
We start by installing the necessary Python packages, for which you should use a virtual environment (so we set that one up first).
Create a Virtual Environment
Before you load reticulate for the first time, we need to create a virtual environment. This is a folder in your project directory with a link to Python and you the packages you want to use in this project. Why?
Packages (or their dependencies) on the Python Package Index can be incompatible with each other – meaning you can break things by updating.
Your operating system might keep older versions of some packages around, which you means you could break your OS by and accidental update!
This also adds to projects being reproducible on other systems, as you keep track of the specific version of each package used in your project (you could do this in R with the renv package).
To grab the correct version of Python to link to in virtual environment:
if (R.Version()$os =="mingw32") {system("where python") # for Windows} else {system("whereis python")}
I choose the main Python installation in “/usr/bin/python” and use it as the base for a virtual environment. If you don’t have any Python version on your system, you can install one with reticulate::install_miniconda().
# I build in this if condition to not accidentally overwrite the environment when rerunning the notebookif (!reticulate::virtualenv_exists(envname ="./python-env/")) { reticulate::virtualenv_create("./python-env/", python ="/usr/bin/python")# for Windows the path is usually "C:/Users/{user}/AppData/Local/r-miniconda/python.exe"}reticulate::virtualenv_exists(envname ="./python-env/")
[1] TRUE
reticulate is supposed to automatically pick this up when started, but to make sure, I set the environment variable RETICULATE_PYTHON to the binary of Python in the new environment:
Optional: make this persist restarts of RStudio by saving the environment variable into an .Renviron file (otherwise the Sys.setenv() line above needs to be in every script):
# open the .Renviron fileusethis::edit_r_environ(scope ="project")# or directly append it with the necessary linereadr::write_lines(x =paste0("RETICULATE_PYTHON=", python_path),file =".Renviron",append =TRUE)
Now reticulate should now pick up the correct binary in the project folder:
library(reticulate)py_config()
python: /mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env/bin/python
libpython: /usr/lib/libpython3.10.so
pythonhome: /mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env:/mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env
version: 3.10.10 (main, Mar 5 2023, 22:26:53) [GCC 12.2.1 20230201]
numpy: /mnt/data/Dropbox/Teaching/r-text-analyse-ffm/python-env/lib/python3.10/site-packages/numpy
numpy_version: 1.23.5
NOTE: Python version was forced by RETICULATE_PYTHON
Installing Packages
reticulate::py_install() installs package similar to install.packages(). Let’s install the packages we need:
reticulate::py_install(c("bertopic", # this one requires some build tools not usually available on Windows, comment out to install the rest"sentence_transformers","simpletransformers"))
Recreating grafzahl from Python
To demonstrate the workflow for reticulate, we do the same analysis as above, but rely on Python functions
import pandas as pdimport osimport torchfrom simpletransformers.classification import ClassificationModel# args copied from grafzahlmodel_args = {"num_train_epochs": 1, # increase for multiple runs, which can yield better performance"use_multiprocessing": False,"use_multiprocessing_for_evaluation": False,"overwrite_output_dir": True,"reprocess_input_data": True,"overwrite_output_dir": True,"fp16": True,"save_steps": -1,"save_eval_checkpoints": False,"save_model_every_epoch": False,"silent": True,}os.environ["TOKENIZERS_PARALLELISM"] ="false"roberta_model = ClassificationModel(model_type="roberta", model_name="roberta-base",# Use GPU if available use_cuda=torch.cuda.is_available(), args=model_args)
We construct a training and test set from the movie review corpus in R:
Now we can train the model on the coded training set and predict the classes for the test set (if you do not have a GPU, this will take a long time, so maybe do it after the course:
# process data to the form simpletransformers needstrain_df = r.imdb_traintrain_df['labels'] = train_df['label'].astype('category').cat.codestrain_df = train_df[['text', 'labels']]roberta_model.train_model(train_df)# test data needs to be a list
I use the data_corpus_guardian from quanteda.corpora show an example workflow for BERTopic. This dataset contains Guardian newspaper articles in politics, economy, society and international sections from 2012 to 2016.
from bertopic import BERTopicfrom sentence_transformers import SentenceTransformerfrom umap import UMAP# confusingly, this is the setup parttopic_model = BERTopic(language="english", top_n_words=5, n_gram_range=(1, 2), nr_topics="auto", # change if you want a specific nr of topics calculate_probabilities=True, umap_model=UMAP(random_state=42)) # make reproducible# and only here we actually run somethingtopics, doc_topic = topic_model.fit_transform(r.corp_news.texts)
Unlike traditional topic models, BERTopic uses an algorithm that automatically determines a sensible number of topics and also automatically labels topics:
Note that -1 describes a trash topic with words and documents that do not really belong anywhere. BERTopic also supplies the top words, i.e., the ones that most likely belong to each topic. In the code above I requested 5 words for each topic:
BERTopic also classifies documents into the topic categories (again not really how you should use LDA topicmodels). And provides a nice visualisation for trends over time. Unfortunately, the date format in R does not translate automagically to Python, hence we need to convert the dates to strings: